Bagging and Boosting a Treebank Parser

نویسندگان

  • John C. Henderson
  • Eric Brill
چکیده

Bagging and boosting, two effective machine learning techniques, are applied to natural language parsing. Experiments using these techniques with a trainable statistical parser are described. The best resulting system provides roughly as large of a gain in F-measure as doubling the corpus size. Error analysis of the result of the boosting technique reveals some inconsistent annotations in the Penn Treebank, suggesting a semi-automatic method for finding inconsistent treebank annotations. 1 I n t r o d u c t i o n Henderson and Brill (1999) showed that independent human research efforts produce parsers that can be combined for an overall boost in accuracy. Finding an ensemble of parsers designed to complement each other is clearly desirable. The parsers would need to be the result of a unified research effort, though, in which the errors made by one parser are targeted with priority by the developer of another parser. A set of five parsers which each achieve only 40% exact sentence accuracy would be extremely valuable if they made errors in such a way that at least two of the five were correct on any given sentence (and the others abstained or were wrong in different ways). 100% sentence accuracy could be achieved by selecting the hypothesis that was proposed by the two parsers that agreed completely. In this paper, the task of automatically creating complementary parsers is separated from the task of creating a single parser. This facilitates study of the ensemble creation techniques in isolation. The result is a method for increasing parsing performance by creating an ensemble of parsers, each produced from data using the same parser induction algorithm. 2 B a g g i n g and Pars ing 2.1 Background The work of Efron and Tibshirani (1993) enabled Breiman's refinement and application of their techniques for machine learning (Breiman, 1996). His technique is called bagging, short for "bootstrap aggregating". In brief, bootstrap techniques and bagging in particular reduce the systematic biases many estimation techniques introduce by aggregating estimates made from randomly drawn representative resamplings of those datasets. Bagging at tempts to find a set of classifiers which are consistent with the training data, different from each other, and distributed such that the aggregate sample distribution approaches the distribution of samples in the training set. Algorithm: Bagging Predictors (Breiman, 1996) (1) Given: training set • = { (y i ,x~) , i E { 1 . . . m } } drawn from the set A of possible training sets where Yi is the label for example x~, classification induction algorithm q2 : A --* • with classification algorithm C e • and ¢ : X ~ Y . 1. Create k bootstrap replicates o f / : by sampling m items from E with replacement. Call them L 1 . . . L k . 2. For each j e { 1 . . . k } , Let Cj = ~ ( £ j ) be the classifier induced using Lj as the training set. 3. If Y is a discrete set, then for each x~ observed in the test set, yi = m o d e (¢ j (x i ) . . . Cj(x~)). y~ is the value predicted by the most predictors, the majority vote. 2.2 Bagging for Parsing An algorithm that applies the technique of bagging to parsing is given in Algorithm 2. Previous work on combining independent parsers is leveraged to produce the combined parser. The rest of the algorithm is a straightforward transformation of bagging for classifiers. Exploratory work in this vein was described by HajiC et al. (1999). Algorithm: Bagging A Parser (2) Given: A corpus (again as a f u n c t i o n ) C : S × T ~ N, S is the set of possible sentences, and T is the set of trees, with size m = [C] = ~ s , t C(s, t) and parser induction algorithm g. 1. Draw k bootstrap replicates C1 . . . Ck of C each containing m samples of (s,t) pairs randomly

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

تولید درخت بانک سازه‌ای زبان فارسی به روش تبدیل خودکار

Treebanks is one of important and useful resource in Natural Language Processing tasks. Dependency and phrase structures are two famous kinds of treebanks. There have already made many efforts to convert dependency structure to phrase structure. In this paper we study an approach to convert dependency structure to phrase structure because of lack of a big phrase structure Treebank in Persian. A...

متن کامل

Feature Engineering in Persian Dependency Parser

Dependency parser is one of the most important fundamental tools in the natural language processing, which extracts structure of sentences and determines the relations between words based on the dependency grammar. The dependency parser is proper for free order languages, such as Persian. In this paper, data-driven dependency parser has been developed with the help of phrase-structure parser fo...

متن کامل

ارائۀ راهکاری قاعده‌مند جهت تبدیل خودکار درخت تجزیۀ نحوی وابستگی به درخت تجزیۀ نحوی ساخت‌سازه‌ای برای زبان فارسی

In this paper, an automatic method in converting a dependency parse tree into an equivalent phrase structure one, is introduced for the Persian language. In first step, a rule-based algorithm was designed. Then, Persian specific dependency-to-phrase structure conversion rules merged to the algorithm. Subsequently, the Persian dependency treebank with about 30,000 sentences was used as an input ...

متن کامل

Boosting the creation of a treebank

We present the results of the experiment of bootstrapping a Treebank for Catalan by using a Dependency Parser trained with Spanish sentences. In order to save time and cost, our approach was to profit from the typological similarities between Catalan and Spanish to create a first Catalan data set quickly by (i) automatically annotating with a delexicalized Spanish parser, (ii) manually correcti...

متن کامل

Improving reservoir rock classification in heterogeneous carbonates using boosting and bagging strategies: A case study of early Triassic carbonates of coastal Fars, south Iran

An accurate reservoir characterization is a crucial task for the development of quantitative geological models and reservoir simulation. In the present research work, a novel view is presented on the reservoir characterization using the advantages of thin section image analysis and intelligent classification algorithms. The proposed methodology comprises three main steps. First, four classes of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000